Context

House Rocket is a digital platform whose business model is the purchase and sale of real estate using technology.

As a Data Scientist contract by the company to help you find the best business opportunities in the real estate market. House Rocket's CEO would like to maximize the company's revenue by finding good business opportunities.

The main strategy is to buy good houses at great incomes at low prices and then resell them later at higher prices. The greater the difference between buying and selling, the greater the company's profit and therefore the greater its revenue.

However, houses have many attributes that make them more or less attractive to buyers and sellers, and location and time of year can also drive prices.

  1. Which houses should the CEO of House Rocket buy and at what purchase price?
  2. Once the house is owned by the company, what is the best time to sell them and what would be the sale price?
  3. Should House Rocket do a renovation to raise the sale price? What would be the suggestions for changes? What is the price increase given for each refurbishment option?

Idea taken from: https://sejaumdatascientist.com/os-5-projetos-de-data-science-que-fara-o-recrutador-olhar-para-voce/

Import libs

Loading and performing a descriptive analysis of the data.

Clean the data

Treat outliers and skewness

The sqft_basement has 60.7% of zero values. This indicate that most houses not have a basement. Therefore, we can't considered this a outlier and neither treat skewness with log.

Assumptions about Business Behavior.

Exploratory Data Analysis

Correlation

In this notebook, we are used the Pearson correlation. This correlation can take on any value in the range [-1, 1]. The sign of the correlation coefficient indicates the direction of the relationship, while the magnitude of the correlation (how close it is to -1 or +1) indicates the strength of the relationship.

The strength can be assessed by these general guidelines:

This text are extracted from https://libguides.library.kent.edu/SPSS/PearsonCorr#cite_cohen

In our casse, we are interesting in medium - strong correlation.

Are houses with many rooms more expensive?

On average, the more rooms, the higher the home's selling price. In the graph below, it is possible to see the relationship between the average sale price and the number of rooms.

How many rooms does the price increase from? What is the price increase for each room added?

To identify the increase in the value of the houses according to the number of rooms. We take the average price in relation to the number of rooms, and subtract the averages.

Example: There is a house with two rooms and we want to know what average value is added if we add another room in the house. The average increase can be calculated subtract the average price of houses with three bedrooms by the average price of houses with two bedrooms.

Therefore, the biggest increase of house value is add the fourth bedroom in the house.

As demonstrated by correlation analysis, while the bathrooms variable assume the biggest values, the most values of price tend to increase in general.

As demonstrated by correlation analysis, while the sqft_living variable assume the biggest values, the most values of price tend to increase in general.

To according with correlation analysis, while the view assumes th biggets values, the prices mean tend to increases. Therefore, the house with 1 is more expensive than houses with 2 views in general.

To according with correlation analysis, while the grade assumes th biggets values, the prices mean tend to increases.

To according with correlation analysis, while the sqft_above assumes th biggets values, the prices mean tend to increases.

In which region are the most expensive houses?

To answer this question, it was necessary to capture the geographic map of the Kings County region and plot the sale price of houses according to location. The map can be found at the following link:

King County

It is possible to see that the houses further north, on average, had the highest sale value. However, homes located near the lake had a higher sales value than homes elsewhere.

Are waterfront houses more expensive?

Yes, the houses on the edge are on average more expensive than the other houses, as can be seen in the graph below.

Which condition houses are more expensives?

Houses with conditions 1 and 2 are, on average, the houses with the sale value cheaper than the others.

Are newer houses more expensive?

The houses that were built between the 1900-1930 and 1990-2010 decades are the houses with the highest sales values, as can be seen in the following chart.

Are renovated houses more expensive?

On average, the houses that were renovated had a higher sale value than other houses, as can be seen in the graph below.

Of the houses that have been renovated, which house is the most expensive?

The houses that were renovated in the 1980s are more expensive than the others, as can be seen in the graph below.

Is there a period when houses are sold more expensive?

The houses that were sold in April are the houses that were sold at a higher value than the rest. However, when we performed the analysis for seasons of the year, we found that homes sold in the spring had the highest value. The graphs below show the results obtained.

Recommendations for the CEO of House Rocket

Which houses does the CEO of House Rocket Deveri buy and at what purchase price?

As per our review, I recommend buying a three bedroom huose, as any renovations manufacturing one or more bedrooms would add great value to the sale. With bathrooms between 1.75 and 2.5, as they have an average value at the time of purchase. Houses, further north, as these are the houses that tend to be more expensive, and consequently it would increase our profit margin. Such houses overlooking the lake or sea. Finally, it considers houses that were built between 1940 and 1980, as it has the lowest purchase values. An observation that such a house could not have been previously renovated, as these have a lower sale value. The buy mean price would be \$1,915,636.36

Once the house is owned by the company, what is the best time to sell them and what would be the sale price?

Best time to sell would be in July, and the sale price would be \$2,600,000.00. Which would mean an increase of \$1,230,800.00

Should House Rocket do a renovation to raise the sale price? What would be the suggestions for changes? What is the price increase given for each refurbishment option?

Yes, we have previously shown that house that have been renovated have the highest value compared to houses that have not. I suggest that in the renovation a bedrooms was added, and on average as shown above, our sales value would increase by $169,187.42.

Prediction Models

Let's use algorithms with their default settings. Then we will use the H2O framework to analyze a larger number of algorithms, as well as choose their best parameters. Before performing the prediction, it is necessary to normalize the distribution of the target variable.

As the independent variables are in different ranges, it is necessary to normalize the data for a single interval. For this, we use sklearn's standard scaler library

Linear Regression

Decision Tree

Random Forest

KNN

SVM

H2o framework

We had better results with h2o considering only 30 tested algorithms. However, it is still possible to improve the results using genetic algorithms to choose the best hyper parameters in the previous algorithms. Or increase the number of algorithms tested in the H2O framework. It is noteworthy that in real cases the logarithmic transformation used in the price variable would have to be reversed.